Red wine quality is explored, observed and analyzed in this project. The underlying objective is to understand the chemical properties that influence the quality of red wines. The statistical program, R, is used for this exploratory data analysis where the dataset can be found here and additional literature on the variables can be found here.
The following are some basic statistics on the dataset and the quality variable.
# Summary Statistics
str(wq)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
summary(wq)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
summary(wq$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
From the 1,599 wine observations across 13 numeric variables, it should be noted that X appears to be the unique identifier with quality being the primary output. It is based on a 10-point scale and was rated by at least three wine experts. Interestingly, the wine quality ranged from 3 to 8 with an average of 5.6 and a median of 6. This indicates that the quality variable is ordinal and discrete.
table(wq$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The following are histogram plots for the 12 variables to kick off the data visualizations.
There are 1,599 wine observations across 13 numeric variables where X is the unique identifier and fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality are the 12 features.
The first 11 variables are physicochemical data points on wine samples and the quality is an 10-point scale output based on sensory data from at least three wine experts.
The main feature of interest is quality. From the Univariate Plots Section, it can be observed that quality follows a near normal distribution where the bulk of the observations are in the 5-6 range with some outliers on either end. This can further outlined by using a more pronounced variable rating, such that a quality score of 0-4 denotes a Poor wine, a score of 5-6 denotes an Average wine, and a score of 7+ denotes a Good wine.
## Poor Average Good
## 63 1319 217
Throughout this exploratory data analysis, the drivers of quality will be unearthed and examined.
Similar to quality, density and pH seem to be normally distributed. Fixed and volatile acidity, free and total sulphur dioxide, sulphates, and alcohol seem to be skewed and long-tailed. It is ambiguous as to what features directly affect quality, but from some high-level research, it appears that alcohol content, acidity and pH might be contributors to quality.
Further researched failed to highlight the difference in benefit of the different types of acidity in wine. Thus, for the purpose of this project, fixed acid (tartaric acid), volatile acid (acetic acid) and citric acid were combined into a variable named, acidity. It should be also noted that the presence of sulphur dioxide and sulphates indicate the presence of sulphuric acid - this is ignored as being beyond the scope of this project.
A new variable, rating, was defined that categorized the wine quality ratings into Poor, Average, and Good buckets to illustrate its normal distribution. Lastly, a key variable, acidity was declared as a sum of fixed acidity, volatile acidity and citric acid. It is hypothesized that acidity is a driver of wine quality.
The distribution of citric acid is fairly unusual given that the distribution of fixed acidity and volatile acidity on a logarithmic scale conforms to the normal distribution of pH. It appears that citric acid has a large number of null values, which could be incomplete or unavailable data.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The dataset in general was fairly tidy such that additional wrangling was not needed.
The bivariate plots begin with a scatterplot matrix. Unfortunately, due to the large file size, generating such a plot took much too long. Instead, a sample of the dataset was used to begin the exploration.
The boxplots on rating and some of the correlations seem noteworthy. They were subsequently explored.
These boxplots provided some very interesting insights. It appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. The difference in behavior of the acids does bring into question the decision of having a combined acidity variable, but a better assessment will be made in subsequent section.
## X fixed.acidity volatile.acidity
## 0.06645261 0.12405165 -0.39055778
## citric.acid residual.sugar chlorides
## 0.22637251 0.01373164 -0.12890656
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05065606 -0.18510029 -0.17491923
## pH sulphates alcohol
## -0.05773139 0.25139708 0.47616632
## quality rating acidity
## 1.00000000 0.81236704 0.10375373
## X fixed.acidity volatile.acidity
## 0.11527163 0.11423756 -0.39124918
## citric.acid residual.sugar chlorides
## NaN 0.02353331 -0.17613996
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05008749 -0.17014272 -0.17517368
## pH sulphates alcohol
## -0.05757386 0.30864193 0.47698109
## quality rating acidity
## 0.97556915 0.79200148 0.09282597
Correlation tests were performed on a plain and logarithmic scale. As expected, citric acid, alcohol and, to a lesser extent, fixed acidity had a positive correlation while volatile acidity had a negative correlation to quality. Interestingly, sulphates appeared to have a stronger correlation on a logarithmic scale, and pH seemed to be hardly correlated.
From the boxplots, it appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. From the correlation tests, similar trends were observed with the exception of the pH showing only about 5.7% correlation and suphates having a better correlation of 30.8%.
The acidity and sulphur dioxide relationships were examined.
There seems to be a trend between fixed acidity and citric acid, and volatile acidity and citric acid, but oddly there seems to be no relationship between fixed acidity and volatile acidity. This could be that the underlining chemistry are not dependent upon each other.
As a purely positive control test, the logarithmic relationship of acidity and pH were observed.
## cor
## -0.7044435
As expected, the higher the acidity, the lower the pH value with a correlation coefficient of 70.4%.
The relationship of free and total sulphur dioxide were investigated.
## cor
## 0.6676665
A correlation coefficient of 66.7% indicates that there is a fairly strong relationship between the two sulphur dioxide states. Some research, indicates that sulphur dioxide is an antimicrobial in wine making and that free sulphur dioxide originates from the total.
The strongest relationship to quality were as follows: - alcohol: 47.6% - sulphates (log10): 30.9% - citric acid: 22.6% - fixed acidity: 12.4% - volatile acidity: -39.1%
For the multivariate plots, the features that bore the strongest relationship to quality were observed by splitting the plots by quality score and faceting them by the three rating categories. It can be noted that higher alcohol, sulphates, citric acid, and fixed acidity, and lower volatile acidity leads to better wine quality. This is inline with the insights uncovered thus far.
Since alcohol, specifically ethanol, is a weak acid, it was thought to be somewhat correlated with the presence of other acids, such as citric acid. The plot of alcohol against citric acid above clearly show their lack of correlation to each other.
To close off the discussion around pH, it can be visually observed to not be driver of wine quality when compared with the very obvious alcohol variable. Though, it should be noted that pH is dependent on the concentration of acids in wine, and as such doesn’t seem to vary far from the 3-4 range.
From the numerous plots above, it can be found that acidity, alcohol content and sulphates contribute to good wines. The final plots will illustrate these findings.
It can be noted that not all acids are created equal. These boxplots illustrates that higher fixed acidity (tartaric acid) and citric acid are found in better quality wines. Furthermore, the absence of volatile acidity (acetic acid) also contributed to a higher wine quality. Therefore, a lower pH alone would be a red herring for wine quality. Afterall, higher acid concentration will lead to a lower pH value, but only tartaric and citric acid seem to benefit wine quality.
These boxplots shows a trend of higher wine quality ratings with higher alcohol content. While it is likely that a higher caliber wine would have a higher percentage of alcohol, additional experimentation is needed to support causation due to the presence of outliers in the Average category.
This final plot illustrate that good wines have an abundance of sulphates and alcohol at the same time. The dotted lines represent the mean for each respective axes, whereby the top right quadrant has a high density of Good wine ratings.
Exploratory data analysis proved to be very effective in understanding relationships within the red wine quality dataset. There were no notable struggles encountered throughout this analysis. It was found that fixed acidity, citric acid, alcohol content and sulphates positively drive wine quality, and volatile acidity negatively drive wine quality. Boxplots seemed to be the most telling visualization for this dataset.
Though it should be noted that wine quality is highly subjective on a individual’s taste; a better study would be the inclusion of wine quantities sold in the market. Further analysis using inferential statistics and similar methodologies should be used to verify the findings in this exploration. Nevertheless, the plots here did uncover an interesting and telling story of wine quality in the available observations.